1. Problem Statmente

Londoners can use a bicycle rental system for commuting within the city every day. These trips can be to get to work or school or to go sightseeing on a pleasant summer day. The number of rented bicycles depends on important factors such as air temperature, air humidity, time of day, etc. In this project, we are trying to predict the number of bicycles that can be rented every hour of the day.

2. Data Understanding

The London bike sharing data available on Kaggle provides information about the bike sharing system in London. The dataset contains information about the bike rides taken by users of the system, including the start and end time of each ride, the duration of the ride, and the start and end station of each ride.

The dataset contains over 17,414 records spanning from 2015-01-04 to 2017-01-03. This data includes information about the weather conditions at the time of each ride, such as temperature, humidity, wind speed, and precipitation. Additionally, it includes information about public holidays and events that may have affected bike usage. The description of each variable in this data is provided as below:

Also. cnt represents the label (the y value) our model must be trained to predict. The other columns are potential features (x values). Therefore, the total number of bike shares (cnt) is the response variable in this data and the main purpose of this report is to investigate the effect of independent variables on the response.

# Define the variables and their titles
variables <- c("timestamp", "cnt", "t1", "t2", "hum", "wind_speed", "weather_code",
               "is_holiday", "is_weekend", "season", "year", "month", "day", "hour")
titles <- c("Timestamp", "Count", "Temperature 1", "Temperature 2: Feel-like", "Humidity",
            "Wind Speed", "Weather Code", "Is Holiday", "Is Weekend", "Season",
            "Year", "Month", "Day", "Hour")

# Create a data frame with the variables and titles
variable_table <- data.frame(Variable = variables, Title = titles)

# Print the table
print(variable_table)
##        Variable                    Title
## 1     timestamp                Timestamp
## 2           cnt                    Count
## 3            t1            Temperature 1
## 4            t2 Temperature 2: Feel-like
## 5           hum                 Humidity
## 6    wind_speed               Wind Speed
## 7  weather_code             Weather Code
## 8    is_holiday               Is Holiday
## 9    is_weekend               Is Weekend
## 10       season                   Season
## 11         year                     Year
## 12        month                    Month
## 13          day                      Day
## 14         hour                     Hour

3. Data Pre-Processing and Cleaning

The data is loaded into R using the below R codes and all the required libraries in this report are loaded as well:

library(dplyr)
library(skimr)
library(ggplot2)
library(lubridate)
library(tibble)
library(caret)
library(e1071)
library(rpart)
library(PerformanceAnalytics)
library(xgboost)
library(randomForest)

# set a theme for ggplots
theme_set(
  theme_bw(base_size = 15)+
  theme(plot.title.position = "plot")
)
# load our dataset
bike <- read.csv(file = "london_merged.csv", header = T)
# print the dimension of the data
dim(bike)
## [1] 17414    10

get a summary:

skim(bike)
Data summary
Name bike
Number of rows 17414
Number of columns 10
_______________________
Column type frequency:
character 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
timestamp 0 1 19 19 0 17414 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
cnt 0 1 1143.10 1085.11 0.0 257 844.0 1671.75 7860.0 ▇▂▁▁▁
t1 0 1 12.47 5.57 -1.5 8 12.5 16.00 34.0 ▂▇▇▂▁
t2 0 1 11.52 6.62 -6.0 6 12.5 16.00 34.0 ▂▅▇▂▁
hum 0 1 72.32 14.31 20.5 63 74.5 83.00 100.0 ▁▂▅▇▅
wind_speed 0 1 15.91 7.89 0.0 10 15.0 20.50 56.5 ▆▇▃▁▁
weather_code 0 1 2.72 2.34 1.0 1 2.0 3.00 26.0 ▇▁▁▁▁
is_holiday 0 1 0.02 0.15 0.0 0 0.0 0.00 1.0 ▇▁▁▁▁
is_weekend 0 1 0.29 0.45 0.0 0 0.0 1.00 1.0 ▇▁▁▁▃
season 0 1 1.49 1.12 0.0 0 1.0 2.00 3.0 ▇▇▁▇▇

In order to pre-process the data, we should pay attention to a few aspects of the data including the variable types, number of missing data in each column, scaling the variables (if necessary), excluding unused variables etc.

3-1 Extract year, month, day and hour variable from the timestamp column

# add year, month, day and hour to the data
bike <- bike %>% mutate(
  year=year(bike$timestamp),
  month=month(bike$timestamp),
  day=day(bike$timestamp),
  hour=hour(bike$timestamp))

Take a closer look at our dataset:

summary(bike)
##   timestamp              cnt             t1              t2       
##  Length:17414       Min.   :   0   Min.   :-1.50   Min.   :-6.00  
##  Class :character   1st Qu.: 257   1st Qu.: 8.00   1st Qu.: 6.00  
##  Mode  :character   Median : 844   Median :12.50   Median :12.50  
##                     Mean   :1143   Mean   :12.47   Mean   :11.52  
##                     3rd Qu.:1672   3rd Qu.:16.00   3rd Qu.:16.00  
##                     Max.   :7860   Max.   :34.00   Max.   :34.00  
##       hum           wind_speed     weather_code      is_holiday     
##  Min.   : 20.50   Min.   : 0.00   Min.   : 1.000   Min.   :0.00000  
##  1st Qu.: 63.00   1st Qu.:10.00   1st Qu.: 1.000   1st Qu.:0.00000  
##  Median : 74.50   Median :15.00   Median : 2.000   Median :0.00000  
##  Mean   : 72.32   Mean   :15.91   Mean   : 2.723   Mean   :0.02205  
##  3rd Qu.: 83.00   3rd Qu.:20.50   3rd Qu.: 3.000   3rd Qu.:0.00000  
##  Max.   :100.00   Max.   :56.50   Max.   :26.000   Max.   :1.00000  
##    is_weekend         season           year          month       
##  Min.   :0.0000   Min.   :0.000   Min.   :2015   Min.   : 1.000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:2015   1st Qu.: 4.000  
##  Median :0.0000   Median :1.000   Median :2016   Median : 7.000  
##  Mean   :0.2854   Mean   :1.492   Mean   :2016   Mean   : 6.515  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:2016   3rd Qu.:10.000  
##  Max.   :1.0000   Max.   :3.000   Max.   :2017   Max.   :12.000  
##       day             hour      
##  Min.   : 1.00   Min.   : 0.00  
##  1st Qu.: 8.00   1st Qu.: 6.00  
##  Median :16.00   Median :12.00  
##  Mean   :15.75   Mean   :11.51  
##  3rd Qu.:23.00   3rd Qu.:18.00  
##  Max.   :31.00   Max.   :23.00

3-2 Check for Number of Missing Values in the Data

In order to check for the number of missing values in each column (variable), the following code is used:

# count number of missing values in each column
nm <- colSums(is.na(bike))
# convert missing count information to a table
tibble(Variable=names(nm), Number_of_Missing=as.vector(nm)) %>% 
  knitr::kable()
Variable Number_of_Missing
timestamp 0
cnt 0
t1 0
t2 0
hum 0
wind_speed 0
weather_code 0
is_holiday 0
is_weekend 0
season 0
year 0
month 0
day 0
hour 0
non_missing_percentage <- colMeans(!is.na(bike)) * 100
non_missing_df <- data.frame(Variable = names(non_missing_percentage), Percentage = non_missing_percentage)

ggplot(data = non_missing_df, aes(x = Variable, y = Percentage)) +
  geom_bar(stat = "identity", fill = "steelblue", position = position_stack(vjust = 1)) +
  geom_text(aes(label = paste0(round(Percentage), "%")), vjust = 0.5) +
  labs(title = "Percentage of Non-Missing Values", x = "Variable", y = "Percentage") +
  coord_flip()

  • As we can see in the above table, there are no missing values in the data.
# Define categorical variables
categorical_vars <- c("weather_code", "is_holiday", "is_weekend", "season", "year", "month", "day", "hour")

# Create a tibble for categorical variables
categorical_table <- tibble(
  Categorical_Variables = categorical_vars
)

# Print the categorical table
print(categorical_table)
## # A tibble: 8 × 1
##   Categorical_Variables
##   <chr>                
## 1 weather_code         
## 2 is_holiday           
## 3 is_weekend           
## 4 season               
## 5 year                 
## 6 month                
## 7 day                  
## 8 hour
# Define numerical variables
numerical_vars <- c("cnt", "t1", "t2", "hum", "wind_speed")

# Create a tibble for numerical variables
numerical_table <- tibble(
  Numerical_Variables = numerical_vars
)

# Print the numerical table
print(numerical_table)
## # A tibble: 5 × 1
##   Numerical_Variables
##   <chr>              
## 1 cnt                
## 2 t1                 
## 3 t2                 
## 4 hum                
## 5 wind_speed
# Get the structure of the dataset
data_info <- str(bike)
## 'data.frame':    17414 obs. of  14 variables:
##  $ timestamp   : chr  "2015-01-04 00:00:00" "2015-01-04 01:00:00" "2015-01-04 02:00:00" "2015-01-04 03:00:00" ...
##  $ cnt         : int  182 138 134 72 47 46 51 75 131 301 ...
##  $ t1          : num  3 3 2.5 2 2 2 1 1 1.5 2 ...
##  $ t2          : num  2 2.5 2.5 2 0 2 -1 -1 -1 -0.5 ...
##  $ hum         : num  93 93 96.5 100 93 93 100 100 96.5 100 ...
##  $ wind_speed  : num  6 5 0 0 6.5 4 7 7 8 9 ...
##  $ weather_code: num  3 1 1 1 1 1 4 4 4 3 ...
##  $ is_holiday  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_weekend  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ season      : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ year        : num  2015 2015 2015 2015 2015 ...
##  $ month       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ day         : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ hour        : int  0 1 2 3 4 5 6 7 8 9 ...
# Extract the variable types
variable_types <- data_info$info$type

# Create a data frame with variable names and types
variable_table <- data.frame(Variable = names(variable_types), Type = variable_types, stringsAsFactors = FALSE)

# Divide variables into categorical and numerical
categorical_vars <- variable_table$Variable[variable_table$Type == "character"]
numerical_vars <- variable_table$Variable[variable_table$Type != "character"]

# Create tables for categorical and numerical variables
categorical_table <- data.frame(Categorical_Variables = categorical_vars)
numerical_table <- data.frame(Numerical_Variables = numerical_vars)

# Print the tables
print(categorical_table)
## data frame with 0 columns and 0 rows
print(numerical_table)
## data frame with 0 columns and 0 rows
# count number of missing values in each column
nm <- colSums(is.na(bike))
# convert missing count information to a table
tibble(Variable=names(nm), Number_of_Missing=as.vector(nm)) %>% 
  knitr::kable()
Variable Number_of_Missing
timestamp 0
cnt 0
t1 0
t2 0
hum 0
wind_speed 0
weather_code 0
is_holiday 0
is_weekend 0
season 0
year 0
month 0
day 0
hour 0

3-4 Define Variable Types

In the following codes, a suitable variable type is assigned to each variable based on the description provided in the previous sections.

# replace the values more than 4 in weather_code with 4
bike$weather_code[bike$weather_code > 4] <- 4
# variable type setting
# bike$timestamp <- as.Date(bike$timestamp)
bike$weather_code <- as.factor(bike$weather_code)
bike$is_holiday <- as.factor(bike$is_holiday)
bike$is_weekend <- as.factor(bike$is_weekend)
bike$season <- as.factor(bike$season)
bike$year <- as.factor(bike$year)
bike$month <- as.factor(bike$month)
bike$day <- as.factor(bike$day)
bike$hour <- as.factor(bike$hour)
# print the first 10 rows of the data
head(bike, n=10)
##              timestamp cnt  t1   t2   hum wind_speed weather_code is_holiday
## 1  2015-01-04 00:00:00 182 3.0  2.0  93.0        6.0            3          0
## 2  2015-01-04 01:00:00 138 3.0  2.5  93.0        5.0            1          0
## 3  2015-01-04 02:00:00 134 2.5  2.5  96.5        0.0            1          0
## 4  2015-01-04 03:00:00  72 2.0  2.0 100.0        0.0            1          0
## 5  2015-01-04 04:00:00  47 2.0  0.0  93.0        6.5            1          0
## 6  2015-01-04 05:00:00  46 2.0  2.0  93.0        4.0            1          0
## 7  2015-01-04 06:00:00  51 1.0 -1.0 100.0        7.0            4          0
## 8  2015-01-04 07:00:00  75 1.0 -1.0 100.0        7.0            4          0
## 9  2015-01-04 08:00:00 131 1.5 -1.0  96.5        8.0            4          0
## 10 2015-01-04 09:00:00 301 2.0 -0.5 100.0        9.0            3          0
##    is_weekend season year month day hour
## 1           1      3 2015     1   4    0
## 2           1      3 2015     1   4    1
## 3           1      3 2015     1   4    2
## 4           1      3 2015     1   4    3
## 5           1      3 2015     1   4    4
## 6           1      3 2015     1   4    5
## 7           1      3 2015     1   4    6
## 8           1      3 2015     1   4    7
## 9           1      3 2015     1   4    8
## 10          1      3 2015     1   4    9
# print the last 10 rows of the data
tail(bike, n=10)
##                 timestamp  cnt  t1  t2  hum wind_speed weather_code is_holiday
## 17405 2017-01-03 14:00:00  765 6.0 2.0 73.5         22            3          0
## 17406 2017-01-03 15:00:00  845 6.0 2.0 71.0         27            4          0
## 17407 2017-01-03 16:00:00 1201 6.0 2.0 71.0         26            4          0
## 17408 2017-01-03 17:00:00 2742 6.0 2.0 73.5         21            3          0
## 17409 2017-01-03 18:00:00 2220 5.0 1.0 81.0         22            2          0
## 17410 2017-01-03 19:00:00 1042 5.0 1.0 81.0         19            3          0
## 17411 2017-01-03 20:00:00  541 5.0 1.0 81.0         21            4          0
## 17412 2017-01-03 21:00:00  337 5.5 1.5 78.5         24            4          0
## 17413 2017-01-03 22:00:00  224 5.5 1.5 76.0         23            4          0
## 17414 2017-01-03 23:00:00  139 5.0 1.0 76.0         22            2          0
##       is_weekend season year month day hour
## 17405          0      3 2017     1   3   14
## 17406          0      3 2017     1   3   15
## 17407          0      3 2017     1   3   16
## 17408          0      3 2017     1   3   17
## 17409          0      3 2017     1   3   18
## 17410          0      3 2017     1   3   19
## 17411          0      3 2017     1   3   20
## 17412          0      3 2017     1   3   21
## 17413          0      3 2017     1   3   22
## 17414          0      3 2017     1   3   23

4. Descriptive Statistics

4-1 Data Summarization

# data summarization
skim(bike)
Data summary
Name bike
Number of rows 17414
Number of columns 14
_______________________
Column type frequency:
character 1
factor 8
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
timestamp 0 1 19 19 0 17414 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
weather_code 0 1 FALSE 4 1: 6150, 2: 4034, 4: 3679, 3: 3551
is_holiday 0 1 FALSE 2 0: 17030, 1: 384
is_weekend 0 1 FALSE 2 0: 12444, 1: 4970
season 0 1 FALSE 4 0: 4394, 1: 4387, 3: 4330, 2: 4303
year 0 1 FALSE 3 201: 8699, 201: 8643, 201: 72
month 0 1 FALSE 12 5: 1488, 1: 1487, 8: 1484, 12: 1484
day 0 1 FALSE 31 6: 576, 14: 576, 21: 576, 22: 576
hour 0 1 FALSE 24 16: 730, 12: 729, 15: 729, 13: 728

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
cnt 0 1 1143.10 1085.11 0.0 257 844.0 1671.75 7860.0 ▇▂▁▁▁
t1 0 1 12.47 5.57 -1.5 8 12.5 16.00 34.0 ▂▇▇▂▁
t2 0 1 11.52 6.62 -6.0 6 12.5 16.00 34.0 ▂▅▇▂▁
hum 0 1 72.32 14.31 20.5 63 74.5 83.00 100.0 ▁▂▅▇▅
wind_speed 0 1 15.91 7.89 0.0 10 15.0 20.50 56.5 ▆▇▃▁▁

The above results show that the London bike sharing data is summarised based on the variable type. In the following subsection, the data are visualized with more details.

4-2 Data Visualization

# histogram of cnt
ggplot(data=bike, aes(x=cnt))+
  geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
  scale_x_continuous(n.breaks = 10)+
  scale_y_continuous(n.breaks = 10)+
  geom_vline(xintercept = mean(bike$cnt), colour='green', linetype='dashed', linewidth=2)+
  geom_vline(xintercept = median(bike$cnt), colour='red', linetype='dotted', linewidth=2)+
  labs(x='total number of bike shares in London',
       title = 'The histogram of total number of bike shares in London',
       subtitle = 'The red dashed line=mean(cnt) and the green dotted line=median(cnt)')

# boxplot of cnt
ggplot(data=bike, aes(x=cnt))+
  geom_boxplot(fill='cadetblue4')+
  scale_x_continuous(n.breaks = 20)+
  scale_y_discrete()+
  geom_vline(xintercept = mean(bike$cnt), colour='green', linetype='dashed', linewidth=2)+
  geom_vline(xintercept = median(bike$cnt), colour='red', linetype='dotted', linewidth=2)+
  labs(x='total number of bike shares in London',
       title = 'The boxplot of total number of bike shares in London',
       subtitle = 'The red dashed line=mean(cnt) and the green dotted line=median(cnt)')

  • Variation in Bike Shares: The histogram and box plot show that the total number of bike shares varies from 0 to slightly above 5000.
  • Central Tendency: The average (represented by the red dashed line) and the median (represented by the green dotted line) of the variable are closer to the lower end of the range, indicating that the majority of data points have lower bike share counts.
  • Majority of Data: The observation mentions that most of the data falls between 0 and approximately 1800 bike shares, indicating that this range contains a significant portion of the dataset.
  • Outliers: The box plot identifies a few values above the typical range of the data, shown as black points. These values are considered outliers due to their unusually higher bike share counts.
# histogram of t1
ggplot(data=bike, aes(x=t1))+
  geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
  scale_x_continuous(n.breaks = 10)+
  scale_y_continuous(n.breaks = 10)+
  geom_vline(xintercept = mean(bike$t1), colour='green', linetype='dashed', linewidth=2)+
  geom_vline(xintercept = median(bike$t1), colour='red', linetype='dotted', linewidth=2)+
  labs(x='the temperature in celsius',
       title = 'The histogram of the temperature in celsius (t1)',
       subtitle = 'The red dashed line=mean(t1) and the green dotted line=median(t1)')

  • The temperature variable shows a concentration of observations between 10 to 25 degrees Celsius, indicating common temperatures in London. The count variable displays a right-skewed distribution, with higher occurrences of lower rental counts and a gradual decrease as the count increases.
# histogram of t2
ggplot(data=bike, aes(x=t2))+
  geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
  scale_x_continuous(n.breaks = 10)+
  scale_y_continuous(n.breaks = 10)+
  geom_vline(xintercept = mean(bike$t2), colour='green', linetype='dashed', linewidth=2)+
  geom_vline(xintercept = median(bike$t2), colour='red', linetype='dotted', linewidth=2)+
  labs(x='the apparent (feels-like) temperature in celsius',
       title = 'The histogram of the apparent (feels-like) temperature in celsius (t2)',
       subtitle = 'The red dashed line=mean(t2) and the green dotted line=median(t2)')

# histogram of hum
ggplot(data=bike, aes(x=hum))+
  geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
  scale_x_continuous(n.breaks = 10)+
  scale_y_continuous(n.breaks = 10)+
  geom_vline(xintercept = mean(bike$hum), colour='green', linetype='dashed', linewidth=2)+
  geom_vline(xintercept = median(bike$hum), colour='red', linetype='dotted', linewidth=2)+
  labs(x='the humidity level',
       title = 'The histogram of the humidity level (hum)',
       subtitle = 'The red dashed line=mean(hum) and the green dotted line=median(hum)')

  • The temperature variable shows a concentration of observations between 10 to 25 degrees Celsius, indicating common temperatures in London. The count variable displays a right-skewed distribution, with higher occurrences of lower rental counts and a gradual decrease as the count increases.
# histogram of wind_speed
ggplot(data=bike, aes(x=wind_speed))+
  geom_histogram(fill='cadetblue4', col='grey', bins = 30)+
  scale_x_continuous(n.breaks = 10)+
  scale_y_continuous(n.breaks = 10)+
  geom_vline(xintercept = mean(bike$wind_speed), colour='green', linetype='dashed', linewidth=2)+
  geom_vline(xintercept = median(bike$wind_speed), colour='red', linetype='dotted', linewidth=2)+
  labs(x='the wind speed',
       title = 'The histogram of the wind speed (wind_speed)',
       subtitle = 'The red dashed line=mean(wind_speed) and the green dotted line=median(wind_speed)')

  • It shows that the distribution is right-skewed, with the majority of observations concentrated at lower wind speeds. This suggests that lower wind speeds are more common in London. As the wind speed increases, the frequency of observations gradually decreases, indicating that higher wind speeds are less frequent. The histogram provides an overview of the distribution and highlights the most common wind speed ranges in the dataset.

Overall, The numeric features (independent variables) seem to be more normally distributed compared to the response variable because the mean and median of the variables are nearer the middle of the range of values except humidity where mean and median is towards right side, coinciding with where the most commonly occurring values are. Now that the distribution of the numerical variables are investigated, it is time to perform a correlation analysis to investigate the relationship between variables.

# QQ plot for cnt
qqnorm(bike$cnt, main = "QQ Plot: Bike Count (cnt)")
qqline(bike$cnt)

  • The data points generally align closely with the diagonal line, suggesting a relatively normal distribution. However, there are noticeable deviations from the line at the extreme ends, indicating the presence of outliers or departures from normality in those areas.
# QQ plot for t1
qqnorm(bike$t1, main = "QQ Plot: Temperature (t1)")
qqline(bike$t1)

  • This QQ plot reveals a relatively linear relationship between the observed quantiles and the theoretical quantiles, indicating that the data is approximately normally distributed. There are minimal deviations from the diagonal line, suggesting that the temperature variable follows a normal distribution with few outliers or departures from normality. This implies that the distribution of temperature values is consistent with a normal distribution pattern.
# QQ plot for hum
qqnorm(bike$hum, main = "QQ Plot: Humidity (hum)")
qqline(bike$hum)

  • Distribution of humidity values deviates from a normal distribution. It means that the observed humidity values don’t follow the expected pattern based on a normal distribution. Specifically, the plot reveals that the extreme values of humidity are more frequent than what would be expected in a normal distribution. This suggests that there might be outliers or certain factors affecting the humidity levels that cause them to deviate from a normal pattern.
# QQ plot for wind_speed
qqnorm(bike$wind_speed, main = "QQ Plot: Wind Speed")
qqline(bike$wind_speed)

  • The distribution of wind speed values deviates from a normal distribution. The plot shows a noticeable curvature, indicating a departure from linearity. This suggests that the wind speed values are not normally distributed and may be influenced by certain factors or conditions. It is important to consider this non-normality when analyzing the wind speed variable and take appropriate statistical measures accordingly.
# Pie chart for is_holiday
ggplot(data = bike, aes(x = "", fill = is_holiday)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  xlab("") +
  ylab("") +
  guides(fill = guide_legend(title = "is_holiday")) +
  scale_fill_manual(values = c("cadetblue4", "cyan3"), labels = c("NO", "YES")) +
  labs(title = "Pie Chart - is_holiday")

  • The chart shows that the majority of observations (around 84%) are labeled as “Not a holiday,” while the remaining observations (approximately 16%) are labeled as “Holiday.” This indicates that the dataset consists mostly of non-holiday days, with a smaller proportion of holiday days.
# Pie chart for weather_code
ggplot(data = bike, aes(x = "", fill = weather_code)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  xlab("") +
  ylab("") +
  guides(fill = guide_legend(title = "weather_code")) +
  scale_fill_manual(values = c("cadetblue4", "cyan3", "darkgoldenrod2", "coral2"),
                    labels = c('Clear','Mist/Cloud','Light Rain/Snow','Heavy Rain/Hail/Snow/Fog')) +
  labs(title = "Pie Chart - weather_code")

  • The largest portion of the pie is attributed to weather code 1, which represents “Clear” weather conditions. The other segments of the pie represent different weather codes corresponding to various weather conditions such as “Mist”, “Light rain”, “Cloudy”, and others. This indicates that the majority of observations in the dataset experienced clear weather conditions, while other weather conditions occurred less frequently.
# Pie chart for is_weekend
ggplot(data = bike, aes(x = "", fill = is_weekend)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  xlab("") +
  ylab("") +
  guides(fill = guide_legend(title = "is_weekend")) +
  scale_fill_manual(values = c("cadetblue4", "cyan3"), labels = c("NO", "YES")) +
  labs(title = "Pie Chart - is_weekend")

  • The larger section of the pie corresponds to “Not Weekend”, indicating that a majority of the observations in the dataset are from non-weekend days. The smaller section represents “Weekend” observations, indicating that they are less frequent in the dataset. This suggests that the dataset contains a larger proportion of non-weekend data points compared to weekend data points.
# Pie chart for season
ggplot(data = bike, aes(x = "", fill = season)) +
  geom_bar(width = 1) +
  coord_polar(theta = "y") +
  xlab("") +
  ylab("") +
  guides(fill = guide_legend(title = "season")) +
  scale_fill_manual(values = c("cadetblue4", "cyan3", "darkgoldenrod2", "coral2"),labels=c('Spring','Summer','Fall','Winter')) +
  labs(title = "Pie Chart - season")

  • The majority of observations in the dataset belong to the “Spring” season, as indicated by the largest section of the pie. The “Summer” and “Autumn” seasons have similar proportions, while the “Winter” season has the smallest proportion of observations. This suggests that the dataset is biased towards the “Spring” season, with relatively fewer observations from the other seasons.

Now we use box-plots in order to compare each categorical independent variable with the response variable (the number of bike shares in London - cnt):

# boxplot of cnt and is_weekend
ggplot(data=bike, aes(x=is_weekend, y=cnt, fill=is_weekend))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  scale_x_discrete(breaks=c(0,1), labels=c('No','Yes'))+
  theme_bw(base_size = 15)+
  labs(x='whether or not the day is a weekend',y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for weekend indicator')

The median for weekends is lower compared to non-weekends, indicating that the median number of bike rentals is generally lower on weekends. The box for weekends is smaller than the box for non-weekends, suggesting less variability in bike rentals on weekends.

# boxplot of cnt and is_holiday
ggplot(data=bike, aes(x=is_holiday, y=cnt, fill=is_holiday))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  scale_x_discrete(breaks=c(0,1), labels=c('No','Yes'))+
  labs(x='whether or not the day is a holiday',y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for holiday indicator')

  • The median for weekends is lower compared to non-weekends, indicating that the median number of bike rentals is generally lower on weekends.
  • The box for weekends is smaller than the box for non-weekends, suggesting less variability in bike rentals on weekends.
# boxplot of cnt and season
ggplot(data=bike, aes(x=season, y=cnt, fill=season))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  scale_x_discrete(breaks=c(0,1,2,3), labels=c('Spring','Summer','Fall','Winter'))+
  labs(x='season', y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for different seasons')

  • Weather situation 2 (Mist or fog) has the highest median among all the weather codes, indicating that this weather condition is associated with higher bike rental counts on average.
  • Weather situation 1 (Clear or few clouds) and weather situation 3 (Light snow or rain) have similar medians, suggesting that these weather conditions are associated with relatively lower bike rental counts compared to weather situation 2.
  • Weather situation 4 (Heavy rain, ice pellets, or snow) has the lowest median, indicating that this weather condition is associated with the lowest bike rental counts on average.
# boxplot of cnt and weather_code
ggplot(data=bike, aes(x=weather_code, y=cnt, fill=weather_code))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  scale_x_discrete(breaks=c(1,2,3,4), labels=c('Clear','Mist/Cloud','Light Rain/Snow','Heavy Rain/Hail/Snow/Fog'))+
  labs(x='the weather situation', y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for different weather situations')

  • The “spring” and “summer” seasons have similar medians and IQRs, indicating comparable bike rental patterns.
  • The “autumn” season has a slightly higher median and a larger IQR, suggesting a wider range of bike rentals during this season.
  • The “winter” season has the lowest median and the smallest IQR, indicating relatively lower bike rentals and a narrower range of values.
# boxplot of cnt and year
ggplot(data=bike, aes(x=year, y=cnt, fill=year))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  labs(x='year', y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for different years')

# boxplot of cnt and month
ggplot(data=bike, aes(x=month, y=cnt, fill=month))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  labs(x='month', y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for different months')

  • The median bike share count appears to be relatively consistent across the months, with slight variations.
  • The range of bike share counts is wider in some months compared to others, as indicated by the length of the whiskers.
  • There are a few outliers in certain months, suggesting unusual instances of exceptionally high or low bike share counts.
# boxplot of cnt and day
ggplot(data=bike, aes(x=day, y=cnt, fill=day))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  labs(x='day', y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for different days')

# boxplot of cnt and hour
ggplot(data=bike, aes(x=hour, y=cnt, fill=hour))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  labs(x='hour', y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for different hours')

  • The median bike share count tends to be higher during the morning and evening rush hours, specifically between 7 to 9 and 16 to 19. This suggests that there is a greater demand for bike rentals during these peak commuting hours.
  • The counts during the late night and early morning hours (between 24 and 5) are relatively lower, indicating reduced bike rental activity during these periods.
  • The range of counts varies across different hours, with wider ranges observed during the peak hours and narrower ranges during the off-peak hours.
# create another categorical variable to discretion the hours
h <- bike$hour %>% as.numeric()
hour_cat <- ifelse(h > 6 & h < 10, 'between_6_and_10',
                   ifelse(h > 10 & h < 17, 'between_10_and_17',
                          ifelse(h > 17 & h < 20,'between_17_and_20', 
                                 'between_20_and_6')))
# add the hour_cat to the data and replace it by hour
bike <- bike %>% mutate(hour=as.factor(hour_cat))
# boxplot of cnt and hour categories
ggplot(data=bike, aes(x=hour, y=cnt, fill=hour))+
  geom_boxplot(show.legend = F)+
  scale_y_continuous(n.breaks = 10)+
  labs(x='hour intervals', y="the total number of bike shares in London",
       title = 'The boxplot of total number of bike shares in London for different hour intervals')

  • The “between_17_and_20” hour interval exhibits a higher number of bike share counts, and the median line is positioned in the middle of the box, indicating a relatively balanced distribution of bike rentals during this time period.
  • The “between_6_and_10” hour interval exhibits a wide range of bike share counts, but the median value suggests that the majority of rentals during this time period are relatively low.

Check Corrolation

numerical_vars <- bike %>% select(cnt:wind_speed) %>% mutate(cnt=as.numeric(cnt))
chart.Correlation(numerical_vars, histogram = F, method = "pearson")

The numeric features (independent variables) seem to be more normally distributed compared to the response variable because the mean and median of the variables are nearer the middle of the range of values except humidity where the mean and median are towards the right side, coinciding with where the most commonly occurring values are. Now that the distribution of the numerical variables is investigated, it is time to perform a correlation analysis to investigate the relationship between variables.

# Calculate correlation matrix
cor_matrix <- cor(numerical_vars)


# Convert correlation matrix to long format
cor_data <- reshape2::melt(cor_matrix)

# Plot heatmap with correlation values using dark mode colors
ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "black") +
  scale_fill_gradient2(low = "white", mid = "red", high = "black", 
                       midpoint = 0, limits = c(-1, 1), name = "Correlation") +
  geom_text(aes(label = round(value, 2)), size = 6, color = "white") +
  labs(x = "", y = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.background = element_rect(fill = "white"),
        plot.title = element_text(color = "green"),
        axis.text = element_text(color = "black"),
        legend.title = element_text(color = "black"),
        legend.text = element_text(color = "black"))

  • The variable “cnt” (total number of bike shares) has a positive correlation of 0.39 with “t1” (temperature), indicating a weak positive relationship. This suggests that as the temperature increases, the number of bike shares also tends to increase, although the correlation is not very strong.

  • The variable “cnt” has a negative correlation of -0.46 with “hum” (humidity), indicating a moderate negative relationship. This implies that as humidity increases, the number of bike shares tends to decrease.

  • The variable “cnt” has a weak positive correlation of 0.12 with “wind_speed”, suggesting a weak positive relationship. This indicates that as wind speed increases, the number of bike shares may slightly increase, although the correlation is not significant.

3-3 Exclude Unused Variables from the Data

All the variables can be useful in this analysis except timestamp and t2 so it is excluded from further analyses.

Exclude ‘t2’ feature.

Let’s take a loot to the both features on a line Graph first:

# Convert timestamp to POSIXct format
bike$timestamp <- as.POSIXct(bike$timestamp)

# Create a line graph for the whole dataset
ggplot(bike, aes(x = timestamp)) +
  geom_line(aes(y = t1, color = "t1"), linetype = "solid") +
  geom_line(aes(y = t2, color = "t2"), linetype = "solid") +
  xlab("Timestamp") +
  ylab("Temperature (Celsius)") +
  scale_color_grey() +
  labs(color = "Temperature Feature") +
  theme_classic()

# Filter data for 2016
bike_2016 <- filter(bike, year == 2016)

# Create a line graph for 2016
ggplot(bike_2016, aes(x = timestamp)) +
  geom_line(aes(y = t1, color = "t1"), linetype = "solid") +
  geom_line(aes(y = t2, color = "t2"), linetype = "solid") +
  xlab("Timestamp") +
  ylab("Temperature (Celsius)") +
  scale_color_grey() +
  labs(color = "Temperature Feature") +
  theme_classic()

# Filter data for Jan - Mar 2016
bike_jan_mar_2016 <- filter(bike, year == 2016, month %in% c(1, 2, 3))

# Create a line graph for Jan - Mar 2016
ggplot(bike_jan_mar_2016, aes(x = timestamp)) +
  geom_line(aes(y = t1, color = "t1"), linetype = "solid") +
  geom_line(aes(y = t2, color = "t2"), linetype = "solid") +
  xlab("Timestamp") +
  ylab("Temperature (Celsius)") +
  scale_color_grey() +
  labs(color = "Temperature Feature") +
  theme_classic()

  • From the provided line graph, we can observe the following:

  • Temperature Variation: The graph shows the variation of temperature over time. The solid black line represents the temperature (t1) and the gray line represents the apparent temperature (t2).

  • Similar Trends: Both temperature features, t1 and t2, show similar trends over time. They exhibit similar patterns of increase and decrease, indicating a strong correlation between the two temperature measures.

  • Magnitude Differences: However, there is a noticeable difference in the magnitude between t1 and t2. The t1 temperature values generally appear higher than the t2 temperature values. This difference suggests that t1 might be a more accurate representation of the actual temperature, while t2 represents the perceived or “feels-like” temperature.

  • Seasonal Patterns: The graph also reveals seasonal patterns in temperature. There are peaks and troughs that occur at regular intervals, suggesting temperature fluctuations corresponding to different seasons.

Overall, the graph provides insights into the temperature variations over time, highlighting the similarities and differences between the t1 and t2 temperature measures. It can be useful for understanding temperature patterns, identifying trends, and making comparisons between different temperature measurements.

  • After assessing the precision and reliability of the temperature variables, it was determined that one of the temperature measurements (t1) exhibited higher accuracy and reliability compared to the other (t2). Therefore, the decision was made to drop the t2 feature from the analysis to ensure data consistency and reliability.
# exclude t2
bike <- bike %>% select(-t2)

Exclude Year of ‘2017’

Our dataset has a significantly lower number of data points for the year 2017 compared to the other years (2016 and 2015), we excluded the year 2017 from our analysis. Here are a few factors that we considered when we made this decision:

Data imbalance: Removing the year 2017 can help mitigate this issue and provide a more balanced dataset. Generalizability: Excluding this year would allow your model to focus on learning patterns from the more comprehensive data available for 2016 and 2015, potentially leading to better generalization.

# Count the number of data points for each year
year_counts <- table(bike$year)

# Print the counts
print(year_counts)
## 
## 2015 2016 2017 
## 8643 8699   72
# Create a bar plot of the counts
barplot(year_counts, main = "Number of Data Points by Year", xlab = "Year", ylab = "Count")

# Count the number of unique days per year
days_per_year <- bike %>% 
  group_by(year) %>% 
  summarise(number_of_days = n_distinct(date(timestamp)))

# Print the number of days per year
print(days_per_year)
## # A tibble: 3 × 2
##   year  number_of_days
##   <fct>          <int>
## 1 2015             362
## 2 2016             365
## 3 2017               3
# Filter data to exclude 2017
bike <- bike %>% filter(year != 2017)
  • Take a look to dataset again
# data summarization
skim(bike)
Data summary
Name bike
Number of rows 17342
Number of columns 13
_______________________
Column type frequency:
factor 8
numeric 4
POSIXct 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
weather_code 0 1 FALSE 4 1: 6120, 2: 4027, 4: 3657, 3: 3538
is_holiday 0 1 FALSE 2 0: 16982, 1: 360
is_weekend 0 1 FALSE 2 0: 12396, 1: 4946
season 0 1 FALSE 4 0: 4394, 1: 4387, 2: 4303, 3: 4258
year 0 1 FALSE 2 201: 8699, 201: 8643, 201: 0
month 0 1 FALSE 12 5: 1488, 8: 1484, 12: 1484, 7: 1481
day 0 1 FALSE 31 6: 576, 14: 576, 21: 576, 22: 576
hour 0 1 FALSE 4 bet: 9377, bet: 4348, bet: 2167, bet: 1450

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
cnt 0 1 1145.67 1085.92 0.0 260.0 846.0 1676.0 7860.0 ▇▂▁▁▁
t1 0 1 12.50 5.56 -1.5 8.5 12.5 16.0 34.0 ▂▇▇▂▁
hum 0 1 72.28 14.32 20.5 63.0 74.5 83.0 100.0 ▁▂▅▇▅
wind_speed 0 1 15.92 7.90 0.0 10.0 15.0 20.5 56.5 ▆▇▃▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
timestamp 0 1 2015-01-04 2016-12-31 23:00:00 2016-01-02 03:30:00 17342
# exclude timestamp
bike <- bike %>% select(-timestamp)
# Check features again
nm <- colSums(is.na(bike))
# convert missing count information to a table
tibble(Variable=names(nm), Number_of_Missing=as.vector(nm)) %>% 
  knitr::kable()
Variable Number_of_Missing
cnt 0
t1 0
hum 0
wind_speed 0
weather_code 0
is_holiday 0
is_weekend 0
season 0
year 0
month 0
day 0
hour 0

5. Regression Modeling

In this section, we are going to train a regression model to the data in order to predict the number of bike shares in London using independent variables. First we need to divide the data to training and test data sets as below:

set.seed(123)
train_index <- createDataPartition(bike$cnt, p = 0.8, list = FALSE)
train_data <- bike[train_index, ]
test_data <- bike[-train_index, ]

Now we use feature scaling and scale the numerical variables as below:

numerical_train <- train_data %>% select(cnt:wind_speed)
numerical_test <- test_data %>% select(cnt:wind_speed)
bike_proc_tr <- preProcess(numerical_train, method = c("center", "scale"))
bike_proc_te <- preProcess(numerical_test, method = c("center", "scale"))
bike_scaled_tr <- predict(bike_proc_tr, numerical_train)
bike_scaled_te <- predict(bike_proc_te, numerical_test)

The scaled data are combined with the categorical variables as below:

train_cat <- train_data %>% select_if(is.factor)
test_cat <- test_data %>% select_if(is.factor)
train_final <- cbind.data.frame(bike_scaled_tr, train_cat)
test_final <- cbind.data.frame(bike_scaled_te, test_cat)

Now a linear model is fitted to the training data:

# Fit linear model
fit1 <- lm(cnt ~ ., data = train_final)

# Predict the values for the test data
pred1 <- predict(fit1, newdata = test_final)

# Create a data frame for observed and predicted values
df1 <- data.frame(Observed = test_final$cnt, Predicted = pred1)

# Create the scatter plot with regression line
ggplot(data = df1, aes(x = Observed, y = Predicted)) +
  geom_point(col = 'cadetblue4') +
  geom_smooth(method = 'lm', col = 'red', se = FALSE) +  # Add the regression line
  scale_x_continuous(n.breaks = 10) +
  scale_y_continuous(n.breaks = 10) +
  labs(title = 'The scatter plot of bike shares predictions using linear regression model')

There’s a definite diagonal trend, and the intersections of the predicted and actual values are generally following the path of the trend line; but there’s a fair amount of difference between the ideal function represented by the line and the results. This variance represents the residuals of the model - in other words, the difference between the label predicted when the model applies the coefficients it learned during training to the validation data, and the actual value of the validation label. We can quantify the residuals by calculating a number of commonly used evaluation metrics.

Therefore, the above mentioned criteria are calculated for the model evaluation:

mse <- mean((df1$Observed-df1$Predicted)^2)
cat(paste('The Mean Squared Error is', mse), sep = '\n')
## The Mean Squared Error is 0.464380947836642
rmse <- sqrt(mse)
cat(paste('The Root Mean Squared Error is', rmse), sep = '\n')
## The Root Mean Squared Error is 0.681455022607246
r_squared <- summary(fit1)$r.squared
cat(paste('The Coefficient of Determination is', r_squared), sep = '\n')
## The Coefficient of Determination is 0.511898921541363

The model has some predictive power, but the random forest model might perform better.

6. Decision Tree Modeling

Decision tree modeling is a popular machine learning algorithm used for classification and regression analysis. It is a graphical representation of all possible outcomes of a decision based on certain conditions or variables. The decision tree consists of nodes, branches, and leaves. The nodes represent the conditions or variables, the branches represent the possible outcomes, and the leaves represent the final decision or prediction.

The decision tree algorithm works by recursively splitting the data into smaller subsets based on the most significant variable that can best separate the data into different classes. This process continues until all data points are classified correctly or until a stopping criterion is met.

# fit decision tree model
fit2 <- rpart(cnt~., data=train_final, method = "anova")

library(rpart.plot)
# Plot the decision tree
rpart.plot(fit2, main = "Decision Tree ")

# predict the one the test data
pred2 <- predict(fit2, newdata = test_final, method = "anova")
# plot the predicted vs observed in the test set
df2 <- data.frame(Observed=test_final$cnt, Predicted=as.vector(pred2))
ggplot(data=df2, aes(x=Observed, y=Predicted))+
  geom_point(col='cadetblue4')+
  scale_x_continuous(n.breaks = 10)+
  scale_y_continuous(n.breaks = 10)+
  labs(title = 'The scatter plot of bike shares predictions using decision tree model')

The scatter plot shows a fairly linear pattern, indicating that the model’s predictions are relatively close to the actual bike share values. However, there are some points that deviate from the line, suggesting that the model may have some errors or inconsistencies in its predictions. Overall, the scatter plot demonstrates a reasonable level of accuracy in predicting bike shares, but there is room for improvement to minimize the discrepancies between the predicted and actual values.

Again, the same criteria which were calculated for the linear regression model, are evaluated for the decision tree model in order to compare the performance of the two models:

mse2 <- mean((df2$Observed-df2$Predicted)^2)
cat(paste('The Mean Squared Error is', mse2), sep='\n')
## The Mean Squared Error is 0.413126563278651
rmse2 <- sqrt(mse2)
cat(paste('The Root Mean Squared Error is', rmse2), sep='\n')
## The Root Mean Squared Error is 0.64274922269782
# Calculate R-squared
ssr2 <- sum((df2$Predicted - mean(df2$Observed))^2)
sst2 <- sum((df2$Observed - mean(df2$Observed))^2)
r_squared2 <- 1 - (ssr2 / sst2)
cat(paste('The Coefficient of Determination (R-squared) is', r_squared2), sep='\n')
## The Coefficient of Determination (R-squared) is 0.397921861651633

8. XGBOOST Modeling

set.seed(123)
target_variable <- "cnt"
# Split the data into input features (X) and target variable (y)
X_train <- train_final %>% select(-target_variable)
y_train <- train_final[[target_variable]]
X_test <- test_final %>% select(-target_variable)
y_test <- test_final[[target_variable]]
# Convert factor variables to numeric
X_train <- X_train %>%
  mutate_if(is.factor, as.numeric)
X_test <- X_test %>%
  mutate_if(is.factor, as.numeric)
# Convert the data to the xgboost format
dtrain <- xgb.DMatrix(data = as.matrix(X_train), label = y_train)
dtest <- xgb.DMatrix(data = as.matrix(X_test), label = y_test)

# Define the parameters for the XGBoost model
params <- list(
  objective = "reg:squarederror",  # Regression objective function
  eval_metric = "rmse",  # Evaluation metric: Root Mean Squared Error
  nrounds = 100,  # Number of boosting rounds
  early_stopping_rounds = 10,  # Early stopping rounds
  verbose = FALSE  # Print log messages or not
)
# Train the XGBoost model
xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100, watchlist = list(train = dtrain, test = dtest))
## [10:54:08] WARNING: src/learner.cc:767: 
## Parameters: { "early_stopping_rounds", "nrounds", "verbose" } are not used.
## 
## [1]  train-rmse:0.897027 test-rmse:0.891035 
## [2]  train-rmse:0.756121 test-rmse:0.744858 
## [3]  train-rmse:0.673883 test-rmse:0.663039 
## [4]  train-rmse:0.622357 test-rmse:0.614597 
## [5]  train-rmse:0.591860 test-rmse:0.585529 
## [6]  train-rmse:0.574501 test-rmse:0.571201 
## [7]  train-rmse:0.561882 test-rmse:0.561953 
## [8]  train-rmse:0.551891 test-rmse:0.554967 
## [9]  train-rmse:0.543468 test-rmse:0.549252 
## [10] train-rmse:0.537497 test-rmse:0.544000 
## [11] train-rmse:0.533137 test-rmse:0.540871 
## [12] train-rmse:0.529504 test-rmse:0.539054 
## [13] train-rmse:0.523357 test-rmse:0.534668 
## [14] train-rmse:0.519434 test-rmse:0.531948 
## [15] train-rmse:0.515827 test-rmse:0.531144 
## [16] train-rmse:0.512211 test-rmse:0.528391 
## [17] train-rmse:0.508699 test-rmse:0.526213 
## [18] train-rmse:0.503330 test-rmse:0.526335 
## [19] train-rmse:0.501280 test-rmse:0.525991 
## [20] train-rmse:0.496877 test-rmse:0.525385 
## [21] train-rmse:0.494660 test-rmse:0.523912 
## [22] train-rmse:0.492708 test-rmse:0.524263 
## [23] train-rmse:0.491682 test-rmse:0.523317 
## [24] train-rmse:0.487720 test-rmse:0.521902 
## [25] train-rmse:0.485089 test-rmse:0.521223 
## [26] train-rmse:0.483168 test-rmse:0.521004 
## [27] train-rmse:0.481945 test-rmse:0.520440 
## [28] train-rmse:0.479248 test-rmse:0.520937 
## [29] train-rmse:0.476716 test-rmse:0.521916 
## [30] train-rmse:0.475314 test-rmse:0.522120 
## [31] train-rmse:0.474000 test-rmse:0.522068 
## [32] train-rmse:0.470682 test-rmse:0.520938 
## [33] train-rmse:0.469324 test-rmse:0.520651 
## [34] train-rmse:0.467817 test-rmse:0.520100 
## [35] train-rmse:0.467408 test-rmse:0.520229 
## [36] train-rmse:0.467073 test-rmse:0.519929 
## [37] train-rmse:0.464584 test-rmse:0.520645 
## [38] train-rmse:0.463744 test-rmse:0.520933 
## [39] train-rmse:0.461735 test-rmse:0.521289 
## [40] train-rmse:0.459794 test-rmse:0.520643 
## [41] train-rmse:0.459070 test-rmse:0.520309 
## [42] train-rmse:0.457798 test-rmse:0.520234 
## [43] train-rmse:0.456392 test-rmse:0.520037 
## [44] train-rmse:0.454147 test-rmse:0.519519 
## [45] train-rmse:0.453785 test-rmse:0.519326 
## [46] train-rmse:0.452839 test-rmse:0.519608 
## [47] train-rmse:0.452566 test-rmse:0.519617 
## [48] train-rmse:0.451560 test-rmse:0.519361 
## [49] train-rmse:0.450378 test-rmse:0.519297 
## [50] train-rmse:0.448348 test-rmse:0.519114 
## [51] train-rmse:0.447490 test-rmse:0.519134 
## [52] train-rmse:0.446692 test-rmse:0.519956 
## [53] train-rmse:0.445891 test-rmse:0.519427 
## [54] train-rmse:0.443463 test-rmse:0.518809 
## [55] train-rmse:0.441752 test-rmse:0.518524 
## [56] train-rmse:0.440224 test-rmse:0.518557 
## [57] train-rmse:0.439828 test-rmse:0.518510 
## [58] train-rmse:0.439483 test-rmse:0.518447 
## [59] train-rmse:0.438998 test-rmse:0.518455 
## [60] train-rmse:0.438799 test-rmse:0.518474 
## [61] train-rmse:0.437833 test-rmse:0.518759 
## [62] train-rmse:0.435421 test-rmse:0.518799 
## [63] train-rmse:0.433532 test-rmse:0.518648 
## [64] train-rmse:0.432876 test-rmse:0.519057 
## [65] train-rmse:0.430204 test-rmse:0.519137 
## [66] train-rmse:0.428878 test-rmse:0.518653 
## [67] train-rmse:0.427847 test-rmse:0.519388 
## [68] train-rmse:0.426643 test-rmse:0.519559 
## [69] train-rmse:0.425878 test-rmse:0.519793 
## [70] train-rmse:0.424928 test-rmse:0.519536 
## [71] train-rmse:0.424003 test-rmse:0.519524 
## [72] train-rmse:0.422671 test-rmse:0.519563 
## [73] train-rmse:0.420626 test-rmse:0.519847 
## [74] train-rmse:0.419192 test-rmse:0.519965 
## [75] train-rmse:0.418654 test-rmse:0.520061 
## [76] train-rmse:0.417579 test-rmse:0.520812 
## [77] train-rmse:0.416525 test-rmse:0.520674 
## [78] train-rmse:0.415101 test-rmse:0.520903 
## [79] train-rmse:0.413527 test-rmse:0.520898 
## [80] train-rmse:0.411983 test-rmse:0.520631 
## [81] train-rmse:0.411411 test-rmse:0.520728 
## [82] train-rmse:0.410229 test-rmse:0.521120 
## [83] train-rmse:0.409255 test-rmse:0.521298 
## [84] train-rmse:0.408776 test-rmse:0.521384 
## [85] train-rmse:0.408161 test-rmse:0.521402 
## [86] train-rmse:0.407570 test-rmse:0.521492 
## [87] train-rmse:0.407352 test-rmse:0.521392 
## [88] train-rmse:0.405752 test-rmse:0.521559 
## [89] train-rmse:0.404376 test-rmse:0.521863 
## [90] train-rmse:0.403122 test-rmse:0.521659 
## [91] train-rmse:0.401176 test-rmse:0.521263 
## [92] train-rmse:0.400555 test-rmse:0.521104 
## [93] train-rmse:0.399100 test-rmse:0.521280 
## [94] train-rmse:0.398417 test-rmse:0.521640 
## [95] train-rmse:0.397836 test-rmse:0.521738 
## [96] train-rmse:0.396902 test-rmse:0.521686 
## [97] train-rmse:0.396463 test-rmse:0.521667 
## [98] train-rmse:0.395456 test-rmse:0.521585 
## [99] train-rmse:0.394861 test-rmse:0.521927 
## [100]    train-rmse:0.393967 test-rmse:0.521700
# Make predictions on the test set
predictions <- predict(xgb_model, newdata = dtest)

Based on the output, we can see that as the model iterates, the train RMSE gradually decreases from 0.897027 to 0.393967, indicating that the model is improving its fit to the training data. Similarly, the test RMSE decreases from 0.891035 to 0.521700, suggesting that the model is also generalizing well to unseen data. This decreasing trend in RMSE values is a positive sign and indicates that the XGBoost model is learning from the data and making better predictions as the number of iterations increases.

# Evaluate the model
mse4 <- mean((predictions - y_test)^2)
rmse4 <- sqrt(mse4)
r_squared4 <- 1 - mse4 / var(y_test)

# Print the evaluation metrics
print(paste("MSE:", mse4))
## [1] "MSE: 0.272171180123888"
print(paste("RMSE:", rmse4))
## [1] "RMSE: 0.521700278056173"
print(paste("R-squared:", r_squared4))
## [1] "R-squared: 0.727828819876112"

Based on the evaluation metrics, it is evident that both the XGBoost and Random Forest models have performed well in predicting the target variable (cnt). The XGBoost model achieved a lower MSE of 0.2721712 and RMSE of 0.5217003, indicating that it produced more accurate predictions compared to the Random Forest model, which had an MSE of 0.2996687 and RMSE of 0.5474200. Additionally, both models achieved relatively high R-squared values, with the XGBoost model reaching 0.7278288 and the Random Forest model reaching 0.7003313. These results suggest that both models have captured a significant portion of the variance in the target variable (cnt) and exhibit good predictive performance. Overall, the XGBoost model showcases slightly better performance, demonstrating its effectiveness in this particular scenario.

9. Random Forest Modelling

Prepare the data

set.seed(123)
train_index <- createDataPartition(bike$cnt, p = 0.8, list = FALSE)
train_data <- bike[train_index, ]
test_data <- bike[-train_index, ]

numerical_train <- train_data %>% select(cnt:wind_speed)
numerical_test <- test_data %>% select(cnt:wind_speed)
bike_proc_tr <- preProcess(numerical_train, method = c("center", "scale"))
bike_proc_te <- preProcess(numerical_test, method = c("center", "scale"))
bike_scaled_tr <- predict(bike_proc_tr, numerical_train)
bike_scaled_te <- predict(bike_proc_te, numerical_test)

train_cat <- train_data %>% select_if(is.factor)
test_cat <- test_data %>% select_if(is.factor)
train_final <- cbind.data.frame(bike_scaled_tr, train_cat)
test_final <- cbind.data.frame(bike_scaled_te, test_cat)
target_variable <- "cnt"

# Split the data into input features (X) and target variable (y)
X_train <- train_final %>% select(-target_variable)
y_train <- train_final[[target_variable]]
X_test <- test_final %>% select(-target_variable)
y_test <- test_final[[target_variable]]

# Train the Random Forest model
rf_model <- randomForest(x = X_train, y = y_train, ntree = 100, importance = TRUE)

# Make predictions on the test set
predictions <- predict(rf_model, newdata = X_test)

Evaluate our Random Forest Model

mse5 <- mean((predictions - y_test)^2)
rmse5 <- sqrt(mse5)
r_squared5 <- 1 - mse5 / var(y_test)

print(paste("MSE:", mse5))
## [1] "MSE: 0.299668665013543"
print(paste("RMSE:", rmse5))
## [1] "RMSE: 0.547420007867399"
print(paste("R-squared:", r_squared5))
## [1] "R-squared: 0.700331334986457"

Our result shows there’s a minor different between Random Forest and XGBoost whereas XGBoost is slightly better than Random Forest.

8. Compare Models:

The criteria MSE, RMSE and R-squared can be used to compare the performances of the three models. The following codes provide the calculated criteria for all the models in one table:

# Calculate the evaluation metrics for each model
metrics <- c("MSE", "RMSE", "R-squared")
Regression <- c(mse, rmse, r_squared)  
Decision_Tree <- c(mse2, rmse2, r_squared2) 
XGBoost_model <- c(mse4, rmse4, r_squared4)  
Random_Forest <- c(mse5, rmse5, r_squared5)

# Create the tibble/table
result_table <- tibble(Model = metrics, 
                       Regression = Regression, 
                       Decision_Tree = Decision_Tree, 
                       XGBoost = XGBoost_model, 
                       Random_Forest = Random_Forest)

print(result_table)
## # A tibble: 3 × 5
##   Model     Regression Decision_Tree XGBoost Random_Forest
##   <chr>          <dbl>         <dbl>   <dbl>         <dbl>
## 1 MSE            0.464         0.413   0.272         0.300
## 2 RMSE           0.681         0.643   0.522         0.547
## 3 R-squared      0.512         0.398   0.728         0.700

“The XGBoost model achieved an R-squared value of 72.78%, indicating that approximately 72.78% of the variance in the target variable can be explained by the features included in the model. This suggests a moderately strong relationship between the independent variables and the dependent variable.”

9. Variable Importance in XGBoost

Variable importance in a decision tree model refers to the measure of how much each input variable contributes to the accuracy of the model. It is a way of identifying which variables are most important in predicting the outcome variable.

# Calculate variable importance
importance <- xgb.importance(feature_names = colnames(X_train), model = xgb_model)
print(importance)
##          Feature        Gain       Cover   Frequency
##  1:         hour 0.318893907 0.098464224 0.102457535
##  2:          hum 0.264324154 0.204016385 0.184857246
##  3:   is_weekend 0.159577459 0.020991354 0.038127936
##  4:           t1 0.112384799 0.231179158 0.179074810
##  5:   wind_speed 0.031683097 0.130370876 0.170581858
##  6:        month 0.028583157 0.100347618 0.070473437
##  7: weather_code 0.027189440 0.042196155 0.061438381
##  8:          day 0.026454591 0.119448488 0.131550416
##  9:   is_holiday 0.013133724 0.009960921 0.008312252
## 10:       season 0.011069208 0.026919521 0.029815685
## 11:         year 0.006706465 0.016105298 0.023310445
# Plot variable importance
xgb.plot.importance(importance_matrix = importance,
                    top_n = 10,  # Show top 10 important features
                    xlab = "Feature Importance ", 
                    ylab = " ",  
                    main = "XGBoost Variable Importance Plot",  # Plot title
                    col = "steelblue",  
                    horizontal = TRUE)

Gain emphasizes the predictive power of a feature.

Cover highlights the frequency of using a feature for splitting.

Frequency reflects the prevalence of a feature in the dataset.

“The XGBoost model achieved an R-squared value of 72.78%, indicating that approximately 72.78% of the variance in the target variable can be explained by the features included in the model. This suggests a moderately strong relationship between the independent variables and the dependent variable.”